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1. Introduction 

Cluster analysis is related to the concept of outliers. If a part of a dataset 
forms a well separated cluster, this means that the other points of the dataset 
appear outlying with respect to the cluster. It may be interpreted synonymously 
that "a cluster is homogeneous'* and that "it does not contain any outlier". The 
idea of Fixed Point Cluster (FPC) analysis (Hennig 1997, 1998) is to formalize 
a cluster as a data subset that does not contain any outlier and with respect to 
that all other data points are outliers. 

The concept is applied to clusterwise linear regression in this paper. That 
is, a relation 

y~x!f3 + u y *(«) = 0, (1) 

between a dependent variable y and an independent variable x 6 2R P x {1} 
(/Jp-H denoting the intercept parameter) should be adequate for a single clus- 
ter. The values of x can be fixed or random. The error variable u is assumed 
as Gaussian distributed with variance a 2 independent of x. There are lots of 
applications of clusterwise linear regression, e.g. in biology (Hosmer 1974) 
and economics (DeSarbo and Cron 1988, Wedel and DeSarbo 1995, Wedel and 
Kamakura 2000). A further example is given in Section 5. 

Most of the methods for clusterwise linear regression are based on least 
squares and maximum likelihood estimation for mixture and partition models. 
For overviews see Wedel and Kamakura (2000), Hennig (1999). Viele and 
Tong (2000) treat such mixtures from a Bayesian viewpoint, Cook and Critch- 
ley (2000) develop a method to detect regression clusters graphically, Shannon, 
Faifer, Province and Rao (2002) use a tree-based approach, and Morgenthaler 
(1990) uses the local minima of redescending M-estimators to find such clus- 
ters. Morgenthaler's approach is an ancestor of FPC analysis, as explained in 
Section 2. 

To illustrate the idea of FPC analysis, consider the artificial dataset of 
Figure 1. From the viewpoint of robust statistics, the dataset can be interpreted 
as a main part of 90 points (circles), Gaussian distributed along a linear regres- 
sion line, and 11 outliers. Davies and Gather (1993) emphasize that the term 
"outlier*' should be defined with respect to a reference distribution. That is, the 
shape of the "good" points has to be assumed to define what a "bad" point is, 
and the region of outliers can be defined e.g. as a region where the density of 
the reference distribution is low. In this sense, the 1 1 points indicated by tri- 
angles and diamonds are outliers with respect to the distribution underlying the 
data subset of circles, and the circles are the set of non-outliers with respect 
to themselves. However, while it is usually assumed in robust statistics that at 
least half of the points are non-outliers, we observe that the 10 points indicated 
by triangles can as well be considered as non-oudiers with respect to their own 
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Figure I . Artificial data- 



underlying linear regression distribution, in which case all remaining 91 points 
are defined as outliers. 

Such subsets are in some sense homogeneous (they do not contain any 
outlier) and well separated (all other points are outliers with respect to them) 
and therefore it is reasonable to call them "clusters". 

Note that not all data subsets are exactly the non-outliers with respect to 
themselves: If the distributions of the type (1) without further assumptions to 
the x-values are chosen as the class of reference distributions, the 1 1 non-circles 
in Figure 1 cannot all at the same time be reasonably defined as non-outliers 
unless at least some of the circles in between are non-outliers, too. 

"Consisting of the non-outliers with respect to itself can be formalized 
as a fixed point condition, which is done in Section 2. For combinatorial rea- 
sons, it is impossible to check this condition for all data subsets. Therefore, 
an implementation is needed to find with high probability all meaningful fixed 
point clusters of a dataset. Such an implementation is given in Section 3. In 
Section 4, FPC analysis is compared by means of a simulation study to an al- 
ternative procedure for clusterwise linear regression, namely the mixture model 
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ML-estimator introduced by DeSarbo and Cron (1988). Apart from FPCs, I do 
not know of any clusterwise linear regression procedure that accounts explic- 
itly for outliers, although the approach of using mixtures of ^-distributions (Peel 
and McLachlan 2000) may be easily extended to the linear regression setup. If 
the distribution of the independent variable is not too far from a Gaussian dis- 
tribution, it is possible to apply the MCLUST-package discussed by Fraley and 
Raftery (1998 t 1999), which performs an ML-estimation of a mixture model of 
multivariate Gaussian distributions and allows for noise modeled by a Poisson 
process component. The package is suggested for 'linearly shaped clusters" by 
DasGupta and Raftery (1998), and I included it in the simulation study (it is 
not claimed that FPC analysis as defined here is a competitor with MCLUST 
in non-regression setups). Other packages for Gaussian mixtures such as EM- 
MIX by McLachlan et al. (1999), which also fits more robust ML-estimators 
for mixtures of ^-distributions, could be used as well. It should be empha- 
sized, however, that in clusterwise linear regression data subsets are interpreted 
as homogeneous groups if they follow a common linear relation between the 
independent and the dependent variable, while clustering by Gaussian or t- 
mixtures attempts to divide such groups if the independent variable alone is 
non-homogeneous. 

FPC analysis is not as good as the ML methods if their model assump- 
tions are fulfilled exactly, but it turns out to be clearly superior to regression 
mixture ML under the occurrence of outliers and to MCLUST under strongly 
non-Gaussian independent variables. Furthermore, overlapping FPCs may give 
some additional insights in the data structure. In Section 5, FPC analysis is 
applied to a dataset concerning tone perception. 

While the simulation study treats the recovery of Gaussian mixture com- 
ponents under the occurrence of outliers, FPC analysis is essentially not a 
method to estimate mixture components. FPCs can be interpreted as estimators 
for theoretical FPCs of populations (as defined in Section 2), which provide an 
alternative definition of a cluster in a stochastic model. Theoretical FPCs are 
treated elsewhere in greater detail (Hennig 2000a), as well as the consistency 
of data FPCs for them (Hennig 2000c). 

2* Clusters and Outliers in Linear Regression 

An outlier identification procedure is needed to formalize the idea out- 
lined in the introduction. For a reference model Pp ff2 G of the form (1), G de- 
noting a regressor distribution, the easiest outlier region in the spirit of Davies 
and Gather (1993) is defined as 



Pfafi) : = ((*■ y) € 2R" x {1} x JR : (y - x'/3) 2 > ccr 2 }, (2) 
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that is, points outside a strip of width 2y/co* around the regression hyperplane 
defined by 0 are defined as outliers. The tuning constant c can be chosen as a 
large quantile of the -distribution to get a low probability for the occurrence 
of outliers under Pp a2 Q . 

This definition is used to define an FPC with respect to a distribution P 
on iR p+1 : An FPC should be a set S which contains exactly the non-outliers 
with respect to itself. For a given distribution P and set 5 let Ps denote the 
conditional distribution of P on S. For arbitrary P let 0(P) and a 2 (P) denote 
the LS-linear regression and error variance functional, respectively (A — 0 in 
case of non-existence). 

4 For (£,<t 2 ) € 2R p+1 x 2R+ let h{P>a 2 ) == A(c,Pp ^ G ) c from (2) be 
the corresponding region of non-outliers, which does not depend on G. Now 
a set S containing exactly all non-outliers with respect to itself, i.e., all points 
close enough to the regression hyperplane defined by the parameters of Ps, is 
defined as follows: 

Definition 2,1 A set S fulfilling S = f(S), f(S) = h(g(S))> g{S) = 1fi(Psh 
a 2 (Ps)) is called a theoretical FPC of a distribution P. 

This could be written as well as a fixed point condition concerning cr 2 ), 
namely by considering hog. Examples of theoretical FPCs are given in Hennig 
(2000a). 

FPCs for datasets can be defined by replacing /3(F) and o 2 {P) by esti- 
mators. I introduce some notation first: Let(X,y) := ({xf u yi) y . . . , (x! ni y n )Y t 
where x» e Ui p x {1}, yi 6 JR, i = 1, . . . , n, be a regression dataset. Let 
Wp ^ be the diagonal matrix of indicators (weights) 

l[(yi-x;/3) 2 <ca 2 ], t = l,...,n, 

of the points lying close enough to the regression hypeiplane defined by (3. Let 
n(W) = trace(W) the number of points indicated by W. 

Definition 2.2 A diagonal matrix Wp ffJ with diagonal in {0, l} n is called 
Fixed Point Cluster Matrix (FPCM) w.r.r (X, y) (and the indicated points 
form an FPC), iff{0,c 2 ) € 2R P+1 x 11% is a fixed point of 

/: (/3,^ 2 )^(/3(w^),a 2 (w^ 3 )), 
where p(W) := (X'W'X^X'Wy, 
* 2 (W) := ^wfc^T (y " X0(W))' W (y - X0(W)) . 

In case of the non-existence of (X'WX)" 1 , o 2 ) :== 03, oo). 
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Analogously to Definition 2,1, W is required to indicate exactly the 
points close enough (in terms of the error variance estimator a 2 (W)) to the re- 
gression hyperplane defined by the regression estimator $(W). That is, to ver- 
ify the FPC property of a data subset, its regression and error variance estimator 
have to be calculated and it has to be checked if all its points (and no others) 
are inside the corresponding strip around the regression hypeiplane. This holds 
for the subset of circles as well as for the subset of triangles in Figure 1 if c is 
chosen suitably, e.g. c = 6.635, being the 0,99-quantile of the x 2 -distribution. 
The choice of c is discussed in the Sections 3.5 and 4. L Note that FPCs inherit 
the equivariance properties of the LS-regression and scale estimator. 

Most recent outlier identifiers are based on robust estimators, see e.g. 
Davies and Gather (1993), Rocke and Woodruff (1996) and Hawkins (1999), 
and it may be wondered why the outlier identifier used in Definition 2.2 is based 
on the LS-estirnator and the corresponding error variance. Indeed, it would be 
of interest to use a more sophisticated outlier identifier to define FPCs, but this 
leads to some additional theoretical and computational difficulties and is left to 
future research. On the other hand, Kosinski (1999) demonstrates that an outlier 
identifier based on calculation of non-robust mean, variance and covariance 
estimators on data subsets can have a good performance as well. 

The definition of FPCs given here is a further development from the fol- 
lowing approach of Moigen thaler (1990), which is related to robust methods: 
A redescending M-estimator is defined by 

0 £i v s t 

where p is symmetric about 0, monotone increasing between 0 and some k > 0 
and constant outside {-k, k). s is an error scale constant and has to be chosen 
in advance. A local minimum of (3) fulfills the fixed point condition 

0 = (X'W^X^X'W^y, (4) 

where W(0) is a diagonal matrix of weights Wi{0) - ^E^yJ 1 * $ = 
p\ possibly piecewise (Huber 1981, p. 183 ff.). That is, such a local mini- 
mum can be interpreted as a weighted LS-estimator, where observations with 
(Vi - xJ/9) 2 /5 2 > k 2 get a weight of zero and can be interpreted as outliers, 
because of i/> = 0 outside (— &,fc). This is why these estimators are called 
"redescending". 

Redescending M-estimators are not so popular in robust statistics, be- 
cause the non-uniqueness of the solutions of (4) poses certain theoretical and 
practical problems. However, Morgenthaler (1990) utilized this non-uniqueness 
to locate different subpopulations (clusters). Such multiple solutions can be 
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found by use of a fixed point algorithm (sometimes called "iterati vely reweighted 
least squares" in this setup) from varying starting values. The general idea to 
use solutions of the fixed point equation of redescending M~estimators for clus- 
tering goes back to Hampel (1975). 

FPC analysis as defined here uses the same idea with binary weights 
(more general weights would be possible) and defines the error scale by a 
fixed point equation analogous to (4) to account for clusters with differing error 
scales. Therefore, its solutions are no longer local minima of a global objective 
function such as (3). There is no ordering between the clusters. 

Performing a cluster analysis in such a way has two main advantages: 

1. The number of clusters does not need to be specified in advance. 

2. The parameter estimator for one cluster is not only unaffected by points 
from the other clusters (as long as they are well separated), but further- 
more the influence of observations which lie far from any linear regres- 
sion population is weighted down to zero. That is, not all points (and 
not even their majority) need to belong to linear regression clusters. This 
is different from clustering based on mixture models: While it is possi- 
ble for methods like those of DasGupta and Raftery (1998) and Peel and 
McLachlan (2000) to handle easy situations like the diamond outlier of 
Figure 1 adequately, it cannot be ruled out that the addition of further 
outliers affects the parameter estimators for the points belonging to ho- 
mogeneous subpopulations. 

A third difference to methods based on mixture or partition models is that ob- 
servations may belong to more than one cluster, which may be seen as an ad- 
vantage or as a drawback, see the discussion in Section 5. (The a posteriori 
membership probabilities from mixture model ML-estimators can be larger than 
zero for more than one cluster, but nevertheless each observation is modeled as 
coming from exactly one component in a mixture model.) In particular, all sub- 
sets lying exactly on a regression hyperplane, i.e., a 2 = 0, form FPCs unless 
the covariate points are collinear. For very small subsets, this may not be very 
interesting, but larger exactly linear portions of the data are often meaningful. 
It will be discussed later how to avoid meaningless FPCs in the output of the 
procedure. Note that the system of FPCs does not have a hierarchical structure 
in general. 

It follows directly from Definition 2.2 that outliers outside the non-outlier 
strip of an FPC never destroy its FPC property. This may be seen as a robustness 
property of FPC analysis. Unfortunately, it cannot be formalized in terms of a 
high breakdown point, because under certain circumstances, robustness prob- 
lems can be caused by points inside, but near to the border of the strip. Note, 
however, that formal robustness properties even of the simplest cluster analysis 
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procedures are difficult to obtain, see Garcia-Escudero and Gordaliza (1999), 
Kharin (1996, Section 7.6). 

The relation between clustering and outliers was considered recently as 
useful for the sake of outlier identification as well, see Rocke and Woodruff 
(1996,2001). 

3. Implementation 

3.1 A Fixed Point Algorithm 

It is practically impossible to check the FPC-property of every subset of 
a dataset, except if it is very small. But FPCMs can be found by means of a 
fixed point algorithm. Its convergence is shown as Theorem 3,1. 
Fixed point algorithm (FPA): Choose a diagonal indicator matrix W° with 
n(W°) >p + l, fc = 0. 

Stepl Compute 0(W*),a 2 (W*). 

Step 2 W** 1 » diag (l[( yi - x#(W*)) 2: < c<x 2 (W*)]). 

Step 3 End if W* = W*+\ else k = k + 1, step 1 . 

Theorem 3.1 Let c> I //(X'WX)- 1 exists whenever n(W) > p + 1, then 
for some k < oo : W* = W* +1 f Le., the FPA converges in finitely many steps. 
The proof is given in the appendix. 

3.2 Outline of the Implementation 

Because more than one FPC is of interest in cluster analysis, it does not 
suffice to run the FPA only once. The rest of Section 3 is devoted to the devel- 
opment of a procedure which should be able to discover all substantial FPCs 
with high probability, where "substantial" means all FPCs corresponding to 
well separated, not too small data subsets which can adequately be described 
by linear regression distributions of type (1). The implementation involves the 
choice of some constants, which is discussed in detail. A short description of 
the implementation is given in Section 3.6. The basic procedure is simple: 

1. Choose the number of algorithm runs i n>P and the tuning constant c. 

2. Repeat i nj > times: Generate an indicator matrixW 0 with n(W°) = p + 2 
randomly and apply the FPA until convergence. The choice of n(W°) is 
justified in Section 3.3. Store all found FPCMs and count the number of 
times that each FPCM has been found. 
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The choice of i n # and c is discussed in the Sections 3.3 and 3.5. 

Applying the basic procedure, one may observe that the number of found 
FPCs is often larger than one would like to have for the clarity of the interpre- 
tation, unless n is very large or i njP is so small that the result of the analysis 
depends strongly upon chance. There are some reasons for that: 

• Often more than one version of the same visible cluster in the data mani- 
fests itself as an FPCM, because FPC analysis allows an arbitrarily large 
overlap between the clusters. 

• Often there appear small subsets of a dataset that can be fitted almost 
exactly by a regression hyperplane. Because their separation from the 
remaining points is measured by their very small error variance, they can 
meet the FPCM definition. If i UtP is large enough, some algorithm runs 
lead to very small FPCMs by chance. 

• If the dataset is sparse in iR p+1 , i.e., if n is small or p is large, it is not 
very difficult for subsets to be separated well enough from the rest of the 
data and thus to be FPCMs. Partitioning methods and methods based on 
mixture models suffer from lots of local optima of the criterion function 
in this situation, while FPC analysis yields lots of FPCs. 

An applicable procedure needs a third step, which will be described precisely 
in the Sections 3.4 and 3.6: 

3* Reduce the number of FPCs to be interpreted by 

• defining groups of similar found FPCMs (clusters ofclusters\ 

• discarding all FPCMs of groups that are found too seldom, 

• choosing a representative FPCM for each of the remaining groups. 

3.3 The Number of Algorithm Runs 

Suppose that a dataset of size n contains an FPCM W of size n(W), 
which is homogeneous and well separated from the rest of the data. The num- 
ber of algorithm runs should be chosen in order to find such an FPCM with 
high probability. This probability can be calculated on the base of the probabil- 
ity p w that this FPC is found by a single run of the EPA, started with n(W°) 
randomly selected points. Of course, pw cannot be determined exactly. How- 
ever, when the data can be partitioned in "good points" and outliers, subset 
based methods of estimation and outlier identification behave robustly if they 
start with a subset consisting only of good observations (see e.g. Rousseeuw 
and Leroy 1987, Kosinski 1999). If the points of W are considered as "good", a 
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Table 1: Number of algorithm runs t n ,p 
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starling configuration with points only from W will lead to a relatively reliable 
non-outlier region. The FPA starting from such points will usually lead to W 
or a very similar FPC (which may exist), if W is well separated enough and 
does not have significantly separated subpopulations (in which case these sub- 
populations would be of main interest). Consequently, under the circumstances 
mentioned above, at least the probability of finding a member of the Single 
Linkage cluster of similar FPCs of W (see Section 3,4) can be approximated 
reasonably by 

/n(W)x 

9n(W) ~ / n V (5) 

The minimal possible n(W°) = p + 2 maximizes g n (w) and * s therefore rec- 
ommended. It cannot be expected to find FPCs with too small n(W)/n in 
a feasible number of algorithm runs. Thus, a decision is necessary about the 
smallest size n^n of an FPCM that one wants to find with high probability. 

i nJ) can be chosen so that the approximated probability to start with p + 2 
points from an FPC with n m i n points at least i m \ n times is at least 0.95: 

, := min{t : QB(i,q nmin ; 0.05) < t min }, (6) 



where QB(n,p;a) denotes the a-quantile of the Binomial(n,p)-distribution 
(compare Appendix B of Kosinski 1999). 

While tmin = 1 minimizes i njh I suggest to take a larger value of i m in» 
say i m in — 3, as long as the computation time allows, because this makes the 
result of the analysis more stable. Table 1 gives some values of i niP for rimm == 
^. Unfortunately, i n # increases exponentially in p. It decreases moderately in 
n. The computation time for a single algorithm run increases in p and n, but 
apparenUy not very fast. However, p > 4 is not feasible at the moment, except 
with manually chosen cluster candidates as starting configurations. 



3.4 Reduction of the Number of FPCs 



Usually some of the inj? algorithm runs lead to FPCs which are very 
similar to others or not very stable. The number of FPCs can be reduced by 
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defining representative FPCs for groups of similar FPCs, and by discarding 
apparently unstable FPCs. 

The agglomeration of groups of FPCs requires a similarity measure be- 
tween FPCs. Similarity between FPCs should be defined by means of the num- 
ber of common data points and not by means of their regression and error 
variance parameters to keep the equivariance properties of FPCs. A similar- 
ity measure between the indicator vectors v and w of subsets of a dataset (main 
diagonals of indicator matrices V, W, respectively) is defined by relating the 
number of points of the intersection of the subsets to the sum of their sizes: 

( 2v w 

so that 0 < 5*(v,w) < 1, where s*(v y vt) = 0 iff v and w have no point in 
common, and s*(v, w) = 1 iff v = w. To define a partition of the FPCMs, 
one can specify 0 < s^* < 1 so that v,w are interpreted as "similar" if 
s*(v, w) > Scut^ The Single Linkage clusters of index s^t are defined as the 
connectivity components of the graph with the FPCMs as vertices and edges 
between all pairs v, w where .5*(vyw) > $ C ut< Cormen, Leiserson and Rivest 
(1990, p. 477) give an algorithm to compute these connectivity components 
without calculating the whole Single Linkage tree. This seems to be the easiest 
method to get a reasonable partition based on similarities without assumptions 
about the number of groups. I suppose Scut = 0.85, which means that two 
FPCs of 20 points each are considered as similar if they have at least 17 points 
in common. A subset of at least 16 points is considered as similar to a set of 20 
points. 

Complete Linkage clustering could as well be reasonable, but if its cluster 
index s^t is chosen moderately smaller than for the corresponding Single Link- 
age solution, it does not seem to lead to considerably different results. Chaining 
problems seem unlikely in this setup. 

For each of the groups, a representative FPCM is chosen. The ratio of the 
number of findings i w to its estimated expected value (based on the approxi- 
mation g„(w)) 

rw = A- (8) 

*n,p9n(W) 

measures the stability of an FPCM. Therefore, it is reasonable to choose the 
FPCM with the largest r\y as the representative FPCM. 

The effect of the Single Linkage reduction can be seen in Table 2 in 
Section 4.L However, my experience is that the effect of this reduction is more 
important in real data, where the tails of the error distributions are not exactly 
Gaussian and FPCs differ often by the inclusion or exclusion of single points in 
the error tail regions, compare the example in Section 5. 
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The similarity s # is used in Section 4.2 as well to measure the quality of 
the cluster recovery by the compared cluster analysis methods. 

Some FPCs can be considered as <c too unstable", namely too small ones 
and the ones found too seldom. If n(W) < n~ defined by 

n_ :~ min{n : QB(i„ >p , 0.5) < i m \ u } (9) 

then an FPCM W will be reproduced in a further analysis with an estimated 
probability of at most 0.5. Thus it should be excluded. This should be done be- 
fore the Single Linkage agglomeration to prevent that a whole group of larger 
FPCs is lost because of a small representative FPC that occurred often by 
chance. 

If an FPCM is found seldom, there can be similar FPCMs corresponding 
to the same, possibly relevant, pattern of the data. But Single Linkage groups of 
FPCMs should be excluded as well, if their members are found less than i m { n 
times. 

3.5 The Tuning Constant c 

The tuning constant c defines the size of the distance that a point must 
have from the center of a Gaussian distribution to call it an "outlier'*. The larger 
c, the more separation is needed for a data subset to meet the definition of 
FPCMs. Formally this can be seen by considering theoretical FPCs as defined 
in Definition 2.1. It can be shown (Hennig 2000c) that a single linear regression 
distribution of type (1) has a unique theoretical FPC for c > 3, which contains 
98.5% of the mass for c = X? ;l _ 0 = 6635 resulting in an error variance of 
0.9001, more for larger c and less for smaller c. If a mixture of such a distri- 
bution P with another distribution Q is considered, it depends on the amount 
of mass from Q inside of A(c> Ps) c > if there remains a theoretical FPC corre- 
sponding to P. The smaller c, the closer to P the regions of high density of Q 
may lie, so that there is still an FPC corresponding to P. Examples are given in 
Hennig (2000a). 

But these are only asymptotical considerations, and there is an important 
effect of the sample size n in dependence of the dimension p. If n is small, 
the data can be so sparse that almost all strips around arbitrary regression hy- 
perplanes contain data subsets that form FPCs because they are well separated 
enough from the rest of the data. That is, the larger p and the smaller n, the 
larger c is required to prevent the occurrence of such meaningless FPCs. This 
corresponds to Rousseeuw's (1994) words: "My interpretation of the "curse of 
dimensionality" is that several structures can exist simultaneously in the same 
dataset" Unfortunately, up to now there is no well-founded theory to relate 
n, p and c, and furthermore such a relation depends upon the other parameters 
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of the algorithm to find FPCs, which were discussed in the previous sections. 

The approach taken here is to simulate the number of found FPCs of a 
homogeneous Gaussian linear regression population for the proposed values of 
nminJmin and s^t and various values of n, p and c. The results are given 
in Section 4.1, Table 2. Formula (10) can be used to choose a value of c for 
particular values of n and p. It approximates roughly the value of c from the 
simulations so that it is as small as possible with sufficient certainty that only 
one FPC is found for a homogeneous population. More formally: In 20 simula- 
tion runs there should be at most a mean of 1.2 found FPCs, from which always 
only one is representative: 

t \ 33 2900000 >wix 

c(n ' p) = 3 + [ 2 ~(P-i)/2 n ]i/3 + {2-(?-i)/2 n] r < 10 > 

Table 2 shows the c(n, p)-values corresponding to n and p chosen for the sim- 
ulation. If the asymptotical value c a of c is desired to be larger than 3, c = 
max[c(n,p), c a ] can be chosen. 

For small n, (10) leads to enormous values for a It is not reasonable 
to demand a separation defined by more than, say, c = 70, for a clearly sep- 
arated cluster, and therefore higher values of c(n,p) indicate that data of the 
corresponding n and p are typically too sparse to apply FPC analysis reason- 
ably. However, I made good experiences with the application of (10) for larger 
values of n than those chosen in the simulations. 

3.6 A Description "Ready to Run" 

Step 1 Choose c according to (10), n min = § , i min = 3, s cut = 0.85. Of 
course, all these choices are subjective because they concern trade-offs 
between more information and better interpretability, more stability and 
lower computing time, respectively, as discussed in the previous sections. 

Step 2 Compute i n%p according to (6), n_ according to (9). 

Step 3 Repeat i niP times: Generate an indicator matrix W° with n(W°) = 
p + 2 by random and apply the FPA. Store all found FPCMs W with 
n(W) > n_, Count the number of times that each FPCM has been found. 

Step 4 Compute the similarities for each pair of FPCMs according to (7). 

Step 5 Compute the Single Linkage clusters of index s cut of FPCMs. 

Step 6 For j = l f 

ij := ]T t w . 

group(W)==; 



262 



C. Hennig 



If ij > imin, choose 

W* ;= arg max {rw}> 

group(W)=j 

r w defined according to (8). 

The Wjf, j = 1, . . . , jo, with ij > i m [ n are the representative FPCMs. 
Let rir denote its number in the following. 

It might be disappointing that the result of an FPC analysis depends upon 
chance via the random starting configurations, that the approximations of the 
probability to find a relevant FPC are rough, and that the computing time gets 
very large for large p. However, minimization of multidimensional criterion 
functions as necessary in cluster analysis based on estimators in mixture models 
leads to similar problems. There are some reasonable approaches to use only 
one good starting point instead of many random configurations (e.g. De Veaux 
1989, Fraley and Raftery 1999, Coleman and Woodruff 2000), but there is even 
less theoretical foundation in terms of the probability to discover a given cluster, 
and for large p to my knowledge there are also no simulations. 

4. Simulations 
4.1 FPCs in Homogeneous Populations and Choice of c 

In a population consisting of only one homogeneous regression compo- 
nent asymptotically only one FPC is to be expected. The simulations of this 
section are to show the relation between the number of found FPCs, n, p, and 
the tuning constant c. Here, all points - . . y x p > y) 6 2R P+1 were gener- 
ated according to a p + 1 -dimensional Gaussian distribution with mean 0 and 
covariance matrix l p +v The procedure of Section 3.6 was used. 

There were 20 simulations runs for each constellation. The number of 
found FPCs n c was recorded as well as the number of their Single Linkage 
clusters found often enough (n r ). The simulation results are given in Table 2. 

While it is obvious that the number of FPCs decreases with increasing n, 
increasing c and decreasing p, these relations do not seem to follow a simple 
functional pattern. The formula (10) was fitted by 44 trial and eiTor" to get not 
too huge values of c(n,p) which guarantee for not more than one cluster in a 
homogeneous population with high probability. The values of c(n,p) for the 
simulated values of n and p are given as well in Table 2. 

Note that because of the equivariance properties, the results generalize 
to arbitrary homogeneous linear regressions with Gaussian distributed indepen- 
dent variables. Unfortunately, arbitrarily distributed independent variables are 
not covered in general. 
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Tkble 2: Average number of representative FPCs (found FPCs) n r (n c ) for homogeneous data 



p = o 


c = 6 


c = 10 


c = 20 


c = 30 


c(n,p) 


n = 25 


3.8 (9.4) 


2.5 (3.9) 


1.5 (24) 


1.2(1.5) 


78.7 


n = 50 


2.7(6.8) 


1.3 (2.1) 


10 (1.1) 


1.0(1.0) 


19.2 


n= 100 


1.4(2.6) 


1.0(1.1) 


1.0 (1.0) 


1.0 (1.0) 


10.4 


n = 200 


1.0(2.2) 


1.0(1.1) 


1.0(1.0) 


1.0 (1.0) 


8.2 


P=l 


c = 6 


c= 10 


c = 20 


c = 30 


c(n,p) 


n = 25 


15.7(39.7) 


9.4(15.3) 


3.9 (5.2) 


2,6(2.9) 


199.9 


n = 50 


9.6(39.5) 


3.2 (10.4) 


1.4 (2.7) 


1.0(1.5) 


35.2 


n = 100 


2.6(9.7) 


1.1(1.3) 


1.0 (1.0) 


1.0(1.0) 


13.0 


n = 200 


1.2(3.0) 


1.0(1.1) 


1.0 (1.0) 


1.0(1.0) 


9.0 


p=2 


c = 6 


c = 10 


c = 20 


c = 30 


c(n,p) 


n = 25 


98.7(411.8) 


92*2(239.5) 


59.3 (145.2) 


39.6(55.1) 


540.6 


n = 50 


51.1 (238.5) 


19.4 (53.0) 


38(7.3) 


1.8(3.2) 


78.7 


n=100 


12.3 (66.4) 


1.8(4.5) 


1.0(1.1) 


1.0(1.0) 


19.2 


n = 200 


1.6(8.6) 


1.0(1.3) 


1.0 (1.0) 


1.0(1.0) 


10.4 


p = 3 


c = 6 


c = 10 


c = 20 


c = 30 


c(n,p) 


n = 50 


195.1 (788.6) 


179.7 (645.2) 


103.0(279.6) 


18.5 (50.6) 


199.9 


n= 100 


79.3 (292.3) 


9.1 (48.9) 


1.2(1.9) 


1.0(1.0) 


35.2 


n = 200 


4.6(39.8) 


1.0(1.2) 


1.0(1.0) 


1.0(1.0) 


13.0 


n = 300 


1.3 (7.1) 


1.0(1.3) 


1.0(1.0) 


1.0(1.0) 


10.1 



4*2 Comparison of Methods 

The performance of FPC analysis was compared to two other procedures 
from the literature by means of a Monte Carlo simulation, namely 

Maximum Likelihood Ciusterwise Linear Regression (MLCLR) as ex- 
plained by DeSarbo and Cron (1988). They assume a fixed sequence of 
regressor values xi, . . . , x n and model 3/1, . . . ,y n as independently dis- 
tributed according to 

k 

k 

where ej > 0, j = 1, . . . , fc, and ^ ej = 1. The ej denote the propor- 

i=i 

tions of the A; mixture components. They compute Maximum Likelihood 
estimators for the parameters (c^^a*), j = l t . . . under fixed k 
by use of the EM-algorithm, which also provides estimators ly, i = 
1, . . . , n, j = 1, for the probability that the point (x^, j/j), con- 

ditional on its value, was generated by the mixture component j. The 
point (xj,yj) can then be classified as belonging to component 
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:= arg max{€y}. Wedel and DeSarbo (1995) propose the Consistent 

3 

Akaike's Information Criterion (CAIC) of Bozdogan (1987) to estimate 
the number of mixture components k. I applied the algorithm with esti- 
mators from a random partition of the data points as starting values for 
the parameter estimators, an upper bound of 7 for k and a lower bound of 
10~ 6 for the a], j = 1, — , A; (otherwise the likelihood function would 
be unbounded). The algorithm was terminated when the increase of the 
log-likelihood function fell below 10~ 7 . I implemented the method as 
described in the statistical programming language R. 

Model Based Gaussian Clustering with Noise (MBGCN) as implemented in 
the software package MCLUST. A current version is treated in Fraley and 
Raftery (1998, 1999). They assume the points Z( = (x*, yj),tsl,... l n, 
as i.i.d. distributed according to 

k 

£(zi) = e 0 Uc + ]T ^jA^Sp 

where Uc denotes the uniform distribution on a convex set C f a j € IR p+l > 
Ej positive definite (p-fl) x (p+ l)~covariance matrices for j =. 1, . . . y k> 

k 

6j > 0, j = 0, . . . , fc, and y^ gj — 1. The x^-values from R p do not 

include a component for the regression intercept in this setup. Such a 
Gaussian mixture model can also be applied to linear regression data, 
because a linear regression distribution is a p + 1-variate Gaussian dis- 
tribution if the distribution of the independent variables is assumed to be 
a p-variate Gaussian. DasGupta and Raftery (1998) propose the method 
for "highly linear" data. The mixture component Uc is designed to model 
noise or outliers not belonging to any of the Gaussian components. The 
covariance matrices Ej can be decomposed as E^ = AjD; AjDj, where 
Xj is the largest eigenvalue of Ej, Dj is the matrix of eigenvectors, 
Aj = diag(l,a2j, . . . ,<*( P +i)j)- MCLUST is able to fit several mod- 
els defined by various restrictions to A, A and the D ; -matrices. They are 
given in Table 1 of Fraley and Raftery (1999). The software computes 
Maximum Likelihood estimators using the EM-algorithm for the param- 
eters 6 0 , (ei, ai, Ei), . . . , (e*, a*, E*) from starting values computed by 
hierarchical clustering as explained in Fraley and Raftery (1999). The 
component memberships of the points can be estimated by analogy to the 
MLCLR procedure. The Bayesian Information Criterion BIC (Schwarz 
1978) was used for the estimation of the number of components k as well 
as for the choice of an optimally restricted model. 
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An initial estimation of the noise component is needed. Fraley and Raftery 
(1998) propose the use of the software NNclean as explained in Byers 
and Raftery (1998). This software requires the choice of a constant K for 
the number of nearest neighbors of a point involved in the calculations. I 
used K — 10. k < 7 was again assumed. A lower bound for the covari- 
ance determinant (to bound the likelihood function) and a convergence 
criterion were used as implemented in MCLUST. The R-port version of 
MCLUST was used. The Splus-version of NNclean, which worked on 
R with only one small modification, was used. 

Fixed Point Cluster analysis (FPCA) was carried out as described in Section 
3.6. 

Obviously the procedures differ with respect to their underlying models. 
MBGCN assumes Gaussian regressor distributions. The MLCLR model does 
not contain an outlier component, and it assumes the probability of (x*, y») to 
be generated by mixture component j to be constant, regardless of x*. This 
assumption is called "assignment independence", see Hennig (2000b). It will 
be illustrated by the discussion of the simulated data constellations. The theory 
of FPCA (Hennig 2000c) is valid for any distribution on IE?* 1 but it is no exact 
estimation procedure for Gaussian mixture components. Instead, it estimates 
theoretical FPCs. Their existence is guaranteed corresponding to subpopula- 
tions of the type (1), where the rest of the data may be distributed arbitrarily, 
but well separated from the linear regression part That is, I compare proce- 
dures that clearly do not estimate the same features of the data. However, all 
the methods may be applied to the same data with similar interpretations of the 
results. Therefore it is interesting to study the performance of their procedures 
in cases where the assumptions are not fulfilled, but where one may consider 
the methods as appropriate. 

I have chosen four different constellations of n, p, the /3 , a 2 , G-parameters 
and noise for this paper. The simulations indicate that the results depend strongly 
upon all the parameter choices. An arbitrary "ranking" of the procedures could 
easily be illustrated by the choice of the appropriate constellation. I do not at- 
tempt to show that FPCA is generally better than the ML-methods, but at least 
there are some situations where it is superior. 

The constellations are: 

Square-p2: n = 200, p = 2, all a^-values were generated by W(o,i]' x 2 := 
For the points 1-100: y = x\ + u. For the points 101-200: y =s 0.5^2 + 
t*, C(u) = .A/b.o.oooi for all points. See Figure 2, left side. The assump- 
tions of MLCLR are met because the data contain only linear regression 
clusters and the distribution of the independent variable does not vary be- 
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Figure 2. Typical data from the constellations "Square", "3+hoise", and M 2+noise". 

tween clusters. The assumptions of MBGCN are not met because of the 
non-Gaussian regressor distribution. 

Square-pl: The data sets of this constellation were generated as the data of 
the constellation Square, but the values of 22 were not included, thus 
n — 200, p =5 1. This is data with a linear and a nonlinear cluster. It 
meets only the assumptions of FPCA with respect to the remaining linear 
subpopulation (note that FPCA allows the rest of the data to be distributed 
arbitrarily). 

3+Noise: n == 100, p = 1. Points 1-40 were generated by C(x\) = -Afo.o.OQ, 
y = xi + 11, points 41-60 were generated by C{x\) = -A/s.o.09, V = 
-xi + 5 + u, points 61-90 were generated by C{x\) — M.5,o.25 ? y = 
1 -fix, = Mj.o.oi for points 1-90. The points 91-100 were generated 
by W[-i,6)x[-i,2]- See Figure 2 t middle. The assumptions of MBGCN 
are met, approximately even that of an equal shape parameter A for all 
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Table 3: Average maximum similarity of found cluster and average number of found clusters 
nc for constellations "Square-pl" and "Square-p2" 



Data 


Square-p2 


Square-pl 


Method 


Pt. 1-100 


101-200 


nc 




1-100 


101-200 


nc 


MBGCN 


0.461 


0.457 


5.67 




0.856 


0.616 


4.10 


MLCLR 


0.983 


0.979 


2.10 




0.973 


0.652 


3.93 


FPCA 


0.943 


0.961 


3.90 




0.944 


0.523 


2.24 


Table 4: Average maximum similarity a* of nearest found cluster and average number of found 


clusters nc for constellations "3+ Noise" and "2+Noise" 








Data 


3+Noise 


2+Noise 


Method 


Pt, 1-40 


41-60 61-90 




n c 


1-40 


41-75 


nc 


MBGCN 


0.990 


0.980 0.973 


3.05 


0.942 


0.969 


2.55 


MLCLR 


0.694 


0.714 0.529 


3.04 


0.934 


0.914 


3.68 


FPCA 


0.960 


0.787 0.780 


3.82 


0.969 


0.960 


2.98 



clusters. The assumptions of MLCLR are not met because of the noise 
and a strong violation of assignment independence: The domains of the 
regressors are nearly disjoint for the three clusters. 

2+Noi$e: n = 81, p = 1. Points 1-40 were generated by xi = £(x») = 
Mo,u V = #i + % — A/o,o.ooo4- Points 41-75 were generated 
by C(x) = jf 2 .$,u V = 2 + u y C{u) = A/" o ,o.O025- Points 71*73 were 
generated by x = 6, y = 2 + u, C(u) = See Figure 2, right side. 
This is a constellation of two regression clusters and six outliers. Neither 
the assumptions of MLCLR, nor those of MBGCN are met because the 
regressor distribution of the first cluster is not Gaussian, the noise is not 
uniform, and the regressor distributions of the two clusters differ. 

There were 200 simulation runs for each method with each constellation. 
The results are given in the Tables 3 and 4. The estimated number of clusters 
n c (meaning the number n r of found representative FPCs in the case of FPC 
analysis) was recorded as well as the maximum similarity 5* between an esti- 
mated cluster and the given clusters of the constellation, nc does not include 
the noise component in the case of MBGCN. The estimated number n r of FPCs 
cannot be interpreted in the same manner as for the ML-methods, because 

• FPCs may intersect or include each other (in particular there is frequently 
an FPC containing the whole dataset; in constellation "3+noise", some- 
times the union of mixture components 2 and 3 forms an FPC, which 
makes some sense, see Figure 2, middle), and 

• the number of found clusters of the other two methods was limited by 
seven. 
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Therefore* n r can be expected to be slightly larger than the number of 
mixture components. 

The average maximum similarity (over the simulation runs) was used 
as a measure of how good the methods discovered the given clusters. $* was 
discussed in Section 3.4. I prefer this measure to mean (squared) errors of 
parameter estimators for two reasons: 

• The methods aim to find the same clusters, but they do not estimate the 
same parameters. 

• If a method fails essentially to find a cluster, it is not interesting if a pa- 
rameter error is of size 1 or 1000, but such differences influence crucially 
the mean error of the simulation. 

Where the model assumptions of one of the ML methods were fulfilled, the cor- 
responding method led to the best results, as one should expect: MLCLR was 
best for "Square-p2'\ MBGCN was best for "3+Noise". In both cases, FPCA 
yielded better results than the misspecified ML-method. For the constellation 
"2+Noise" where the model assumptions of both ML methods were violated, 
FPCA led to the best results. 

Both MBGCN and MLCLR did not mainly suffer from the oudiers. 
MBGCN was more sensitive against strong non-normality of the independent 
variable in the constellations "Square", and MLCLR against the violation of the 
assignment independence in "3+Noise". 

The performance of FPCA depends upon the size and the separateness 
of the cluster, as can be seen in "3+Noise". The second cluster, which is the 
smallest, and the third cluster, which has the largest error variance, are more 
difficult to find than a large, well separated cluster. 

After subtracting one for the frequently occurring FPC of the whole 
dataset, the number n r of FPCs can keep abreast of the ML methods as an esti- 
mator of the number of "cluster-shaped" mixture components. At "Square-pl", 
it is clearly the best. 

5. Application to Tone Perception Data 

The tone perception data stem from an experiment of Cohen (1984). A 
pure fundamental tone was played to a trained musician. Electronically gen- 
erated overtones were added, determined by a stretching ratio of x. x = 2.0 
corresponds to the harmonic pattern usually heard in traditional definite pitched 
instruments. The musician was asked to tune an adjustable tone to the oc- 
tave above the fundamental tone, y gives the ratio of the adjusted tone to the 
fundamental, Le. y = 2.0 would be the correct tuning for all x-values. The 
data analyzed here belong to 150 trials with the same musician. In the original 
study, there were four further musicians. The scatterplot suggests that there are 
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Figure 3. Partitions of the tone perception data from MLCLR and MBGCN. 



two different linear regression populations, roughly corresponding to (some- 
what biased) correct tuning and tuning to the first overtone, respectively, and 
some points that do not fit to any of them. 

The application of MLCLR and MBGCN, as explained in Section 4, led 
to the partitions shown in Figure 3. MLCLR found six clusters. Among them 
were two clusters corresponding to correct tuning, and to overtone tuning, re- 
spectively. The occurrence of outliers not exacdy consistent to the two most 
obvious lines caused the appearance of four further clusters. This is similar to 
the simulation results of "Square-pl" and "2+Noise", see Table 3 and 4, where 
MLCLR had been able to find the relevant clusters, but the number of clusters 
had frequently been overestimated in the presence of points not consistent with 
any linear regression line. A better fit might be possible with more (or better 
chosen) starting configurations for the EM-algorithm. De Veaux (1989) per- 
formed ML-estimation in the same model, assuming two mixture components 
instead of estimating their number. The result of MBGCN corresponds to the 
optical impression almost perfectly. There are the two main clusters and a noise 
component denoted by "+". This is a very useful result for a clusterwise linear 
regression, but it might be a little bit puzzling, however, from the viewpoint of 
the estimation of 2-dimensional Gaussian mixtures, because of the clear gap in 
the x-values between 1.6 and 1.9. The reason for points with large and small 
x-values to appear in the same clusters is that the distribution of the x-values 
for x > 1.9 is strongly right-skewed. 

FPC analysis as described in Section 3.6 with i n , p = 853, c(n,p) = 
10.07 found twelve FPCs forming three Single Linkage groups. The three rep- 
resentative FPCs with 122, 74, and 66 points, are shown in Figure 5. Their 
parameters (/? 0 ,/?i,cr 2 ) are (1.905,0.048,0.0028), (0.023,0.991,0.0002), and 
(0.794,0.602,0.0011). The numbers of times found (and expectation ratios 
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Figure 4, Representative FPCs of the tone perception data. 



rw) of the Single Linkage groups are 681 (1.49), 141 (1.41), and 10 (0.14). 
Seven points belong to none of the representative FPCs. The FPCs 1 and 2 
correspond to the two clear lines (correct tuning and overtone tuning). Most of 
the points with x « 2.0 are consistent with both of the lines and are assigned 
to both clusters. The third FPC may appear somewhat surprising, but there is 
some justification for it, because the 59 points with x « 2.0 as a whole may be 
seen as a kind of compromise between correct and overtone tuning, and as such 
they are consistent with some points with larger x, which are not fitted well by 
the two main populations. However, the small value of rw = 0.14 indicates a 
lower significance of this cluster compared to the others. A valid interpretation 
may be that the musician does not only tend always to hear the regular tone or 
the overtone as dominant, but sometimes, with x > 1.9 he is confused in such 
a way that he tunes his tone between the two possibilities. The points with ap- 
proximately ordinary overtones (x w 2.0) cannot discriminate clearly between 
the three tendencies. 
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In conclusion, the FPCA solution gives a new insight to the data, which 
is useful if the aim of the analysis is exploratory, while it does not provide esti- 
mators for the parameters of a mixture model (especially not for the component 
proportions) because of the essential intersection of the clusters. 

Note that the result of FPCA as well as that of MLCLR depends on 
chance because of the random selection of starting configurations. Applica- 
tions of FPC analysis to other datasets (illustrating the same tendencies of the 
method) can be found in Hennig (1997, 1999, 2000c). 

6. Conclusion 

Fixed Point Cluster analysis was defined for clusterwise linear regres- 
sion. An algorithm was developed, which aims to find the most significant 
clusters in the data with high probability and prevents meaningless FPCs. Some 
tuning constants are required. Their choice was discussed in detail. FPCA was 
compared with ML-estimators for a linear regression mixture model (MLCLR) 
and a model for p + 1-dimensional Gaussian mixtures (MBGCN/MCLUST) 
by means of a simulation study. Each ML-estimator performed best when 
its model assumptions were fulfilled. FPCA was usually better than an ML- 
estimator with strongly violated assumptions, Outliers cause MLCLR often to 
overestimate the number of clusters. Furthermore MLCLR suffers sometimes 
from violations of the assumption of independence of the assignment of points 
to mixture components of the x-values. MBGCN does not always lead to rea- 
sonable regression clusters if the independent variable deviates strongly from 
a Gaussian distribution. It does not appear to be very sensitive to non-uniform 
outiiers. 

The spirit of FPCA is more exploratory than that of ML-estimation of 
mixture model parameters. The ability of intersecting clusters allows for new 
insights in the data and enables FPCA not to be affected by identifiability prob- 
lems of regression mixtures (Hennig, 2000b). A problem with the FPCA is the 
exponential increase of the computing time with the dimension p. 

The principle of FPC analysis can be transferred to clustering problems 
apart from clusterwise linear regression, see Hennig (1998). The C-software 
f ixreg can be obtained from: 

http://www.math.uni-hamburg.de/home/hennig/ 
An R/S-plus-module is in preparation and will be available soon from the 
same web-site. The C-software needed about 8 seconds for a dataset from con- 
stellation "Square-p2 M of Section 4.2 on a Sun UltraSPARC-Hi 333 MHz, while 
the R-function needed 22 minutes. The tone perception data were analyzed 
much faster (the R-function needed 3 minutes). 
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Appendix 

Proof of Theorem 3. L The proof is divided into three parts. Part 2 contains the 
main idea of the proof: Step 2 of the FPA should get the error variance a 2 (W*) 
small by adding points with squared residual < ca 2 (W*) and excluding those 
with larger residuals. Because c > 1, the inclusion of new points does not 
necessarily decrease the error variance. Part 2 of the proof will show that it 
can be multiplied by an appropriate factor 7r„(w*) such that the product T = 
7Tn(w>)* 2 (W*) is always decreased. 7r n(W «.) is a product of n(W*) factors 
smaller than I. That is, the FPA tries to unite many points (t^w*) small) with 
small <7 2 (W*). Part 1 shows £(W*) to be always uniquely defined. 
Some notation: 

fit ■= ftw fc ), 4 ■= * 2 (w*), 

M fc :=(y-Xi8 fc )'W fc (y-X/y = 
= (n(W*) -p - 1)4, 7r m := UT= P l + i Pi> 

p^=Kf^< 1 )( 1 -^) +1 (^^ 1 )^ 

W+ := max(W* +1 - W*,0), W" := max(W fc - W* +1 ,0), 

the maximum taken element-wise. W + and W~ indicate the data points that 
ace added, removed respectively, by step 2 of the FPA. Assume a\ > 0 for all 
k up to part 3. 

Part 1: Show V& : n(W fc ) > p + 1 by complete induction, which means that 
0* is uniquely defined Vfc. Recall n(W°) > p + 1 and show for m > 0 : 
n(W m ) > p + 1 =» n(W m+1 ) > p + 1. By definition 

|{: : w™ = 1 A ( yi - x</3 m ) 2 > ca 2 m }\ = n(W") 
* o* > #££t => n(W-) < ^±^1- (ID 
Assuming n(W m ) > p + c + 1: 

n(W m+1 ) > n(W») - n(W) > (l - i) n(W m ) + > 
> (l- 0 (p + c+1) + £ 7 i = P + Op+1. 
On the other hand n(W) € J7V and with (11): 

n(W m ) < p + c+ 1 =» 1 > n(W") = 0 => n(W m+1 ) > n(W m ). 
Part 2: Show that 



T : diag({0, l} n ) (0, oo):Wh 7r„ (w) a 2 (W) (12) 
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is strictly decreased by step 2 of the FPA unless n(W + ) = n(W") = 0, i.e., 
w *+i _ W fc Therefore, no W m with W m+l # W m can be repeated during 
the FPA, and the FPA converges in a finite number of steps because there are 
only finitely many indicator vectors of length n. 

Assume that n(W+) > 0, or n(W") > 0 and show T(W* +1 ) - 
T(W fc ) < 0. 

T(W* +1 ) - T(W fc ) = 7r n(w . +l) ^ +1 - * n( w)4 - 

/ ^w>+ti("(W*)-p-l) m \ 2 . 

- ^ < b(Ww)-p-1 *»(W») J <7 fc+ 

+X$gft^(M k+l -M k ). (13) 

Yield M k+ i = 

min(y - X^* +1 )'W*+ 1 (y - Xfi k+l ) < 

P 

<( y -x^)'w fc +'(y-x^)< 

< M k + n(W+)c^ - n(W-)cal (14) 

by definition of W + and W. If n(W~) > 0, there is strict "<". Hence with 
(13): 

T(W* +1 ) - T(W fc ) < 

„ / *„,w>+M(*(WV P -l+NW+)-n(W-)]c) \ a 

< ( — 1 ■ n ( W ^.)-p-i : ^(W*) ) <7 k (15) 

=s g. Letd := n(W+) - n(W") = n(W* +1 ) - n(W*). If d = 0 then 

Mw>) _ n(W fc ) - p - 1 4- dc 
7r R(wk+l) „(W*+l)-p-l 

and therefore 9 = 0 and T(W fc+1 ) - T(W*) < 0 (there is strict inequality in 
(15) because d - 0 and W fc # W* +1 imply n(W") > 0). Show further 

j<0 ,m> g^ ( ,6) 

T«(W»+>) n(W*+ x )-p-l 

which implies 9 < 0 and again T(W* +1 ) - T(W fc ) < 0 by strict inequality in 
(15), and 

d>0 ^^^> "^-'- 1+ f (17) 
7r n(W * + i) n(W*+ l )-p-l 
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implying q < 0 and strict decrease of T. 
Proof of (16): If nCW** 1 ) < p H- c, then 

n(W *)-p~l+<ic (C-I)d * B ( W ») 

n(W*+ l )-j>-l ^n(W fc «)^p-l - irn^ww)' 
On the other hand, with n(W*) > n(W* +1 ) > p + c, get 

y n(w*») _ n n(W fc )-l /- _ C ~A ^ 

7T n(wfc+l) ~ 1U(WHI) V A ~ 

Use (1 - 6) m > 1 - fern for 0 < 6 < l,m G W. Let b = B j W ^|, H , 
rn = -d Then 

*"n(W*) - , c-1 _ n(W*)-p-l+<te 
^n(ww) ~ n(W*+ 1 ). - p - 1 n(W*+ l ) ~p~l " 
Proof of (17): By complete induction over m > 0 get for b > 0, m E iZV: 

1 " ^imc* > " Ifm) (c) 

Vmi > m2 € Wo where mi + mi = m (18) 
assuming c > 1 and j— ^ < 1. (19) 

In particular! - ^gs = > J > (if. (20) 

Start with n(W* +1 ) > p + c-1. Because n(W* +1 ) > n(W*) > p + 1, 
apply (18) with m = d, mi = n(W fe+1 ) - max(n(W fc ), \p + c - lj + 1), 
6 = n (W fc ) -p - 1. (19) is fulfilled because g£ = ggw&ppi < 1. 

n(W'+>)-l 

jr n(w)t) 11 Pi 

t=n(W») 

n (i-h 1 ) n ^ 

s(i- w *b=t)-(i)*"< 

. , (c-l)d _ n (W»+')-p-l 

< 1 ~ n(^l)-p-l+<fc _ „(wl)-p-l+ < te' 

i.e., (17). With n(W* +1 ) < p + c - 1 get 

frn(W»+') 



_ (\\ d ^ n(W+')-p-l 
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because of (20). 

Part 3: Assume g\ = 0 for some k. > 0 ^ n(W*) > p -f 1 because of 
part i. Further, W f = 1 => - x^) 2 = 0, - 1 <=> (yi - ^ k f - 0 
and n(W k + l ) > n(W*) > p + 1. Hence a| +a = o\ = 0, ^ fc+1 = 0 fcf 
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